feat: vf v1 <> nano bridge#2742
Draft
mikasenghaas wants to merge 101 commits into
Draft
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Points the submodule at the vf-nano EnvServer branch so the orchestrator can build on the env-server abstraction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Switch prime-rl's env path to vf-nano: the orchestrator spawns a vf-nano EnvServer per env (it never loads an environment), dispatches rollouts by task index, and trains on the returned Trace dicts (branches + renderer tokens). - pyproject: dep verifiers -> vf-nano; drop v1/research env packages; only the vf-nano reverse-text example; override out the transitive v1 verifiers (pulled by the prime CLI) so it can't shadow vf-nano's `verifiers` package; add orjson /pandas/msgspec (were transitive via verifiers). - EnvConfig inherits vf-nano's swappable agent/runtime (+ max_turns). - envs.py: spawn EnvServer child + EnvClient, info() for num_tasks/group-scoring, dispatch by task_idx, adapt Trace -> RolloutOutput-shaped dict. - trajectories.py: trace_to_samples (one sample per Trace branch) + trace_to_output. - train_source: index sampling; client pool builds vf-nano ClientConfig; lag monitor vendored; env-server entrypoint repointed; ~14 files retyped off vf.RolloutOutput / vf.ClientConfig. - configs/debug/vf_nano_reverse_text.toml. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er config)
- trace_to_samples stitches each Trace branch's tokens into one TrainingSample
(prompt = branch start, then each turn's new context [masked] + generated
tokens [trained]); drop the RolloutOutput adapter — read the Trace's native
fields directly (reward, error{type,message}, timing generation/scoring,
num_turns, branches).
- envs returns the raw Trace; eval_sink / train_sink / dispatcher / metrics /
orchestrator read native Trace fields (no token_usage/completion/timing.total).
- client pool forwards the shared renderers.RendererConfig to the env server's
renderer client (so it uses qwen3, not the tool-less default fallback).
- debug config: tool_call_parser=hermes (vLLM accepts the agent's tools),
max_steps=20.
- bump deps/vf-nano.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…o timeout) - Env.run_rollout/run_group pass the vf-nano ClientConfig object and a SamplingConfig (built from the env's sampling args) directly — no model_dump, no per-rollout timeout forwarded to the server. - debug config: max_steps=20. - bump deps/vf-nano (typed env-server RPC).
The env server returns a Trace minus its derived fields; the orchestrator resolves
the env's Task subclass (from config.id) and validates the wire dict into a strict
Trace[EnvTask], so the whole orchestrator works with a real, typed vf.Trace —
typed task fields included (e.g. task.answer), nothing subscriptable.
- envs.py: resolve_task_type(env_id); run_rollout/run_group validate -> Trace[EnvTask].
- trajectories/types/dispatcher/train_sink/eval_sink/metrics/filters/advantage/utils
/orchestrator: attribute access on the typed Trace (reward, error{type,message},
branches, timing.<span>.duration, num_turns, ...); derived fields recompute on the
consumer.
- Task/Trace/TimeSpan stay strict (StrictBaseModel) — no extra=ignore anywhere.
- bump deps/vf-nano.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The orchestrator spawns the env server, so request the serve extra (zmq/msgpack) explicitly now that vf-nano keeps them out of core. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`from __future__ import annotations` already defers all annotations to strings, so the quotes + `# noqa: F821` on the TYPE_CHECKING-only `vf.Trace` / `TrainRollout` annotations are unnecessary (no import cycle — verifiers.nano never imports prime_rl). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The field holds a typed vf.Trace, so `trace` reads truer than `raw` (which suggested an unparsed dict). Renames the field + every `.raw` access, the `emit_rollout(trace=...)` param/kwarg, the to_dict field filter, and the dispatcher cancel-path locals. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Drop the FinishedRollout proxy properties (error/reward/is_truncated and the
example_id field); consumers now read r.trace.{reward,is_truncated,task.idx,...}
directly. The trace is the single source of truth.
- Use vf.Trace.has_error for existence checks instead of `.error is not None`.
- Replace the prime-rl trace_* token-length utils with vf.Trace.{completion_len,
total_tokens,has_response} (now on the trace); keep trace_to_samples.
- Carry task_idx end-to-end (GroupState.task_idx, env.run_rollout/run_group(task_idx),
source dict key) instead of the example/example_id dict carrier; identity comes
off trace.task.idx.
- Mark the local-package env arrangement as a temporary/experimental TODO.
- Move the debug config to configs/debug/nano/reverse_text.toml.
- Bump deps/vf-nano (Trace/Turn accessors).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- The env server binds tcp://127.0.0.1:0 and reports its concrete address back over a queue; the orchestrator connects to that. Removes _get_free_port and its TOCTOU race (the OS assigns the port atomically). - A spawned server has already bound + loaded by the time it reports its address, so the untimed info() is enough — only poll wait_for_server_startup for an external (config.address) server, which has no spawn handshake. - Bump deps/vf-nano (port report + Trace/Branch token-length accessors). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Task-subclass introspection now lives in vf-nano (vf.task_type); drop the prime-rl copy and build the typed Trace via vf.Trace[vf.task_type(env_id)]. Bump deps/vf-nano. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SFT trains on a teacher served over the chat client, which returns no token ids, so the trace's turns have tokens=None and trace_to_samples yields nothing. Restore backfill: for each tokenless turn, render its prompt + assistant response with the student chat template and split on the longest common prefix to fill TurnTokens (masks/logprobs come from trace_to_samples). train_sink.process_rollout backfills when any turn lacks tokens, before building samples. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
drop_group's error_rollout_output calls omitted the required task_idx, so an off-policy cancel (on_new_version) raised TypeError. Use the group's task_idx (or -1 when the group is already gone), mirroring handle_completed_rollout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- envs.py: EnvClient now returns Trace[WireTask]; upgrade to this env's real Task subclass via self.trace_type.model_validate(wire.to_wire()). - dispatcher.py: drop the error_rollout_output helper — inline the synthetic error Trace at each call site using vf.Error's field names (type/message/traceback); the task-exception path carries a real traceback, cancels/empty-trajectory carry none. - Bump deps/vf-nano. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nical - Spawned env servers now route their output (logging + subprocess-runtime output) to <output_dir>/logs/envs/<name>.log via a _run_env_server wrapper that redirects stdout/stderr and sets up logging in the child. Previously the orchestrator-spawned server logged nowhere. - Debug config: batch_size 16->128, group_size 8->16, eval num_examples 8->128 (interval=1), matching configs/debug/training_modes/rl.toml. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The orchestrator already passes a train/eval-split log_dir (.../logs/envs/train, .../logs/envs/eval), so _spawn must drop the file directly under it (<log_dir>/<name>.log) rather than re-adding an envs/ subdir — which had buried the train/eval split under logs/envs/<kind>/envs/<name>.log. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Instead of the orchestrator sidecar-spawning each env server as an mp child, the
rl launcher now spawns one `env-server` process per env (train + eval), each on a
free port, with output to logs/envs/{kind}/{name}.log and a crash monitor — same
model as inference/trainer. It sets env.address in the orchestrator config so the
orchestrator attaches (its existing external path) instead of spawning. Envs that
already set address (user-managed external server) are left alone; the orchestrator's
mp sidecar stays as the fallback for running `orchestrator` directly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add RLConfig.env_server_base_port (default 5000); the i-th launcher-managed env binds base_port + i. Drops the get_free_port dependency in the launcher. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Train envs bind base_port + i; eval envs bind base_port + ENV_SERVER_KIND_STRIDE + i (stride 1000), so each kind has headroom for many envs without the blocks colliding (was a single running index — train and eval sat adjacent). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- env_server entrypoint: intercept vf-nano stdlib logging so the server's own logs (EnvServer up, request failures) land in logs/envs/<kind>/<name>.log — previously only loguru output was captured, swallowing them. - envs.py: close the address-handoff mp.Queue after use (no resource_tracker leaked-semaphore warning on the sidecar path). - configs/debug/nano/reverse_text.toml: drop the eval block, mirroring examples/reverse_text/rl.toml (train-only smoke; eval path validated separately). - bump deps/vf-nano (serve/types docstring trim). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…irectly The I/O boundary (save_rollouts + monitor sample tables) now dumps the typed vf.Trace itself (r.trace.model_dump(mode="json")) instead of a Trace+metadata merge — the on-disk rollout is just the trace. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vf-nano renamed its rollout-driver abstraction Agent -> Harness. Update the
integration: EnvConfig.agent -> harness (HarnessConfig/DefaultHarnessConfig);
env.run_rollout/run_group spawn forwards harness_config; the env-server entrypoint
passes harness_config/harness_timeout; debug config uses `harness = {...}`. Bump
deps/vf-nano to the renamed branch.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add r2e-gym-v1 to the base v1 taskset deps + uv sources (editable from deps/verifiers/examples/tasksets/r2e_gym_v1) so the id resolves through the v1 loader, matching the other -v1 tasksets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- v0 configs/rlm_swe/qwen35_4b.toml: restore the train env to r2e and the eval env to swebench-verified-quick (as on main), reverting the scaleswe switch - v1: rename configs/debug/v1/scaleswe.toml -> r2e_gym.toml, point the train env at the r2e-gym-v1 taskset, and drop the eval block Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Apply the edits the prior rename commit missed: - v0 rlm_swe/qwen35_4b.toml: train -> r2e, eval -> swebench-verified-quick (as on main) - v1 debug/v1/r2e_gym.toml: taskset -> r2e-gym-v1, eval block removed Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Env servers spawn their worker pool as fresh `spawn` processes with no logging
handlers (verifiers#1626), so per-rollout logs (rollout start/done, context-exceed
warnings) were silently dropped. Pass `setup_env_server_logging` to verifiers'
`serve_env` as `log_setup`; it runs in the broker and in every worker. A worker
inherits the broker's redirected stdout/stderr, so its logs land in the same
`envs/{train,eval}/<name>.log` as before — no new files or paths.
Bumps deps/verifiers to the worker-logging fix.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Realign the pin onto origin/feat/nano-as-v1 and pick up #1627: the --rich dashboard's token counts fall back to provider usage when the endpoint returns no token ids (no more 0/0). The prior pin 3df34ba5 was a pre-rebase #1626 variant; 955b6cdf already contains the equivalent #1626 (env-server worker logging) plus #1627. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up the serve_env SIGTERM-teardown fix: pool/in-process env servers no longer print a spurious KeyboardInterrupt traceback into the env logs on shutdown. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up #1628 (reap the whole subprocess tree when a runtime run is cancelled). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…2774) * feat(v1): elastic env-server pool (inherit pool config from verifiers) Companion to verifiers#1629. prime-rl's EnvConfig now extends vf.EnvServerConfig, so each env inherits the `pool` discriminated union (static{num_workers=4} | elastic{max_workers=None, multiplex=128}, default elastic) and the orchestrator's env servers scale workers on demand instead of pre-spawning a fixed `auto` count. - Drop the per-env / train-group / eval-group `num_workers` fields + the auto-resolution (ceil(max_inflight/256)); the elastic pool self-sizes from load. - envs.py / env_server.py pass `vf.pool_serve_kwargs(env.pool)` to serve_env. - Bump deps/verifiers to the elastic-pool branch. Breaking: `num_workers` is replaced by `pool`. Configs set `pool = { type = "elastic", multiplex = N }` or `{ type = "static", num_workers = N }`; the rlm_swe + r2e debug configs are migrated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): back-compat shim mapping legacy num_workers -> pool EnvConfig forbids extra fields, so configs still setting the removed `num_workers` would hard-fail. Add a `model_validator(mode="before")` that maps it onto `pool`: an int -> a fixed `static` pool, `"auto"` -> the default `elastic` pool; an explicit `pool` always wins. Keeps existing (incl. out-of-tree) configs parsing without edits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): drop num_workers from rlm_swe + r2e configs (use default elastic pool) The default `pool` is already elastic (multiplex 128), so an explicit `pool` here was redundant — just remove the legacy `num_workers` and inherit the default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_wire validation) Fixes RunRolloutResponse ValidationError 'trace.timing.setup.duration: Extra inputs are not permitted' that crashed every rollout (#1636 drops computed durations from to_wire). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up #1638 (add --resume for evals: re-run a previous run's missing/errored rollouts). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…[WireTask]) (#2781) * chore(v1): stop importing env modules in the orchestrator The orchestrator built its per-env trace_type as Rollout[vf.task_type(env_id)] for v1 envs, and vf.task_type imports the env package just to read its Task subclass for typing the wire trace. Nothing reads typed env task fields - only task.idx and a full task.model_dump - and WireTask (extra="allow") preserves those fields (incl. on disk). Always use Rollout[vf.WireTask], so the orchestrator never imports an env package: the env's type and runtime both live only in the server process. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): hoist the constant Rollout[WireTask] to a module-level ROLLOUT_TYPE It no longer varies per env, so it doesn't belong as a per-instance attribute set in Env.__init__ - lift it to a module constant used directly in run_rollout/run_group. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(v1): cap hendrycks-sanity scoring at 10s Without a scoring timeout (the default is no limit), a wedged math verify holds its rollout's permit forever — sympy can spin past the in-script alarm — and at 512 concurrency that starves the pool and stalls long runs. Set timeout.scoring = 10 on the train and eval envs so the framework cancels and the subprocess runtime kills a runaway verify, freeing the permit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: drop inline comment on the scoring timeout Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mports (#2792) Bump deps/verifiers to feat/nano-as-v1 HEAD (8873a740), which includes verifiers#1654 — the v1 interception rework: role-named clients (EvalClient/TrainClient), route-detected wire dialects (chat/responses/anthropic), 1:1 relay + streaming, reasoning preserved. Adopt the renamed client config classes in prime_rl/utils/client.py: OpenAIClientConfig -> EvalClientConfig, RendererClientConfig -> TrainClientConfig. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ts, #1660) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
It was only a manual editable install, so `uv sync` pruned it. Add it to the env dependency group + [tool.uv.sources] (mirroring r2e-gym-v1) so it persists across syncs and is available out of the box. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
verifiers#1653 (carry mm tensors across the env-server wire) is merged and pinned, so `MessageNode.multi_modal_data` is no longer `exclude=True` — `model_dump(mode="json")` now serializes the base64 pixel tensors into `train_rollouts.jsonl` and the wandb sample tables, bloating every line. They're the training `mm_kwargs` carrier, not part of the rollout record, so exclude them at the dump boundary (train + eval paths). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Declare the remaining 7 verifiers v1 example tasksets (code-golf, deepwiki, glossary, swelego, wiki-search, wikispeedia, wordle) as editable deps so uv sync installs every example, matching the verifiers examples set. chromadb/textarena were already present via the v0 wiki-search/wordle envs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The example harness (examples/harnesses/compact) was missing from prime-rl deps, so the documented --harness.id compact branching example failed to resolve (ModuleNotFoundError: harness compact not found). Declare it like the example tasksets so uv sync installs it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Companion PR to PrimeIntellect-ai/verifiers#1576 for verifiers v1 training integration.